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Abstract — Cloud computing providers are now offering their 
unused resources for leasing in the spot market, which has been 
considered the first step towards a full-fledged market economy 
for computational resources. Spot instances are virtual machines 
(VMs) available at lower prices than their standard on-demand 
counterparts. These VMs will run for as long as the current 
price is lower than the maximum bid price users are willing 
to pay per hour. Spot instances have been increasingly used for 
executing compute-intensive applications. In spite of an apparent 
economical advantage, due to an intermittent nature of biddable 
resources, application execution times may be prolonged or they 
may not finish at all. This paper proposes a resource allocation 
strategy that addresses the problem of running compute-intensive 
jobs on a pool of intermittent virtual machines, while also 
aiming to run applications in a fast and economical way. To 
mitigate potential unavailability periods, a multifaceted fault- 
aware resource provisioning policy is proposed. Our solution 
employs price and runtime estimation mechanisms, as well as 
three fault tolerance techniques, namely checkpointing, task 
duplication and migration. We evaluate our strategies using 
trace-driven simulations, which take as input real price variation 
traces, as well as an application trace from the Parallel Workload 
Archive. Our results demonstrate the effectiveness of executing 
applications on spot instances, respecting QoS constraints, despite 
occasional failures. 

I. Introduction 

Variable pricing virtual machines (also know as "spot in- 
stances'Q are increasingly being employed as a means of 
accomplishing various computational tasks, including high 
performance parallel processing tasks, which are common in 
several areas of science, such as climate modeling, drug de- 
sign, and protein analysis, as well in data analytics scenarios, 
such as execution of MapReduce tasks [1]. Significant cost 
savings and the possibility of easily leasing extra resources 
when needed, are major considerations when choosing virtual 
clusters, dynamically assembled out of cloud computing re- 
sources, over a local HPC cluster ||2l- 

The cloud computing spot market, since introduced by 
Amazon Web Services |3J, |4J, has been considered as the 
first step for a full-fledged market economy for computational 
resources (5). In this market, users submit a resource leasing 
request that specifies a maximum price (bid) they are willing to 

'The terms "spot instance", "instance", "virtual macliine", "VM", and 
"resource" signify the same concept and are used interchangeably in this 
work. 



pay per hour for a predefined instance type. Instances associ- 
ated to that request will run for as long as the current spot price 
is lower than the specified bid. Prices vary frequently, based on 
supply and demand. Price are distinct and vary independently 
for each available datacenter ("availability zone" in Amazon 
terminology), spot instance type, and operating system choice. 
Not all type/OS combinations are available in all datacenters. 
In other words, there are multiple spot markets from where 
to choose suitable computational resources, making the provi- 
sioning problem significantly challenging. 

When an out-of-bid situation occurs, i.e. the current spot 
price for that instance type goes above the user's maximum 
bid, instances are terminated by the provider without prior no- 
tice. Therefore, in spite of an apparent economical advantage, 
an intermittent nature is inherent to biddable resources, which 
may cause VM unavailability. 

Despite the possibility of failures due to out-of-bid situ- 
ations, as we have discussed in our previous work |2|, it is 
advantageous to utilize spot instances to run compute-intensive 
applications at a fraction of the price that would normally 
cost when using standard fixed-priced VMs. Specifically, we 
have demonstrated the effect of different runtime estimation 
methods on the decision-making process of a dynamic job 
allocation policy. Our policy was responsible for requesting 
and terminating spot instances on-the-fly as needed by a stream 
of computational jobs, as well as choosing the best instance 
type for each job based on the estimated job execution time 
on each available type. 

We had previously assumed that users would bid high 
enough so that the chance of spot instance failures due to out- 
of-bid situations would be negligible. In reality, even though 
users only pay the current spot price at the beginning of each 
hour, regardless of the specified bid, there are incentives for 
bidding lower Andrzejak et al, who evaluated checkpointing 
techniques for spot instance fault tolerance, observed that by 
bidding low, significant cost savings can be achieved, but 
execution times increase significantly. Similarly, by increasing 
the budget slightly, execution times can be reduced by a large 
factor li). 

A. Bidding strategies and the need for fault tolerance 

We now elaborate on the potential risks and rewards of 
provisioning a resource pool composed exclusively of spot 



instances in scenarios where QoS constraints play an import 
role. 

Failures due to out-of-bid situations may lead to the inability 
to provide the desired quality of service, e.g.: prolonged appli- 
cation execution times or an inability of applications to finish 
within a specified deadline. To overcome this uncertainty, one 
may come up with a few strategies to decrease the chance of 
failure or mitigate their effects. 

To decrease the chance that out-of-bid situations occur, one 
could to choose to bid as high as possible. Given that, under 
the current model of Amazon spot instances, users pay at 
maximum the current spot price (not the actual bid), there 
would be no apparent disadvantages in bidding much higher 
than the spot price. However, there are incentives for adopting 
more aggressive bidding strategies, i.e. bidding close or even 
lower than the current spot price. 

Firstly, Amazon offers on-demand instances at a fixed price, 
which are identically functional to spot instances and are not 
subject to terminations due to pricing issues. The value set by 
Amazon to these on-demand instances is likely to influence 
the maximum price a user is willing to bid. Thus, this value 
acts as an upper bound for bids of users that would rather 
lease a reliable on-demand instance in cases the spot price 
is equal or above the on-demand price. In fact, by analysing 
the history of spot prices of Amazon EC2, we have observed 
that, over the period of about 100 days from 05-Jul-2011 to 
15-Oct-2011, spot prices have surpassed on-demand prices 
several times across most instances types and datacenters. 
For example, the spot price of one of the most economical 
instances (MISMALL) in the US-EAST region, has reached 
this situation 11 times, for periods of up to 2 hours and 20 
minutes, and price value of up to 17% above the on-demand 
price. 

Secondly, in a scenario where most users submit high bids, 
providers would likely increase the spot price to maximize 
profits. As previously postulated |7|, the Amazon EC2 spot 
market resembles a Vickrey auction style E), where users 
submit sealed bids, the provider gathers them and computes 
a clearing price. The pricing scheme thought to be used by 
Amazon, where all buyers pay the clearing price, is a gener- 
alization of the Vickrey model for multiple divisible goods, 
the standard uniform price auction, on which the provider 
assigns resources to users starting by the highest bidder, until 
all bids are satisfied or there are no more resources. The 
price paid by all users is the value of the lowest winning 
bid (sometimes, the highest non winning bid) |5|. Is has been 
observed that this scheme is a truthful auction, provided that 
the supply level is adjustable ex post (i.e. after the bids have 
been decided) 0. It has also been observed that Amazon may 
be artificially intervening in the prices by setting a reserve 
price adn generating prices at random |9|. In any case, we 
argue that there is an incentive for users to submit fair bids, 
based on the true value they are willing to pay for the resource. 

Thirdly, on a similar note, users may choose to postpone 
non-urgent tasks when prices are relatively high, hoping to 
obtain a lower price (the true value) later, a strategy that can 



be accomplished by placing a bid at the desired price and 
waiting for it to be fulfilled. Similarly, in the case of an out- 
of-bid situation, owners of a non-urgent task would prefer to 
wait for the request to be in-bid again, rather than obtaining 
a new resource under new lease terms (e.g. another VM type, 
or the same type at a higher bid). 

Finally, as observed by Yi et al |10|, one can bid low to 
take advantage of the fact that the provider does not charge 
the partial hour that precedes an out-of-bid situation. Thus, 
delaying the termination of an instance, even when it is not 
needed, to the next hour boundary, one can expect a probability 
of failure before termination, potentially avoiding to pay for 
the last hour. 

The choice of an exact bid value can be empirically derived 
from a number of factors, including observations of price 
history, the willingness of the user to run instances at less 
than a certain price or not run at all, and a minimum reliability 
level required. These factors, when reflected on the bid value, 
define how likely the system is able to meet time and cost 
constraints. 

In any case, the adoption of more aggressive bidding strate- 
gies can result in more failures, and potentially undermine the 
cost savings, as a result of frequent loss of work. Therefore, 
resource provisioning policies aimed at running computational 
jobs on spot instances must be accompanied by fault mitigation 
techniques, especially tailored for the features of cloud com- 
puting spot instances. Notable features of spot instances may 
influence the way fault tolerance works in this scenario. Most 
notably, an hour-based billing granularity and non-payment of 
partial hours in the case of failures, guarantees payment of the 
actual progress of computation ifTOl . Additionally, given that 
providers, such as Amazon, freely provide a history of price 
variations, significantly more informed decisions can be made 
by observing the past behaviour 

B. Our contribution 

This paper proposes a resource provisioning strategy that 
addresses the problem of running computational jobs on inter- 
mittent VMs. Our main objective is to run applications in a fast 
and economical way, while tolerating sudden unavailability of 
virtual machines. We build up on our previous work JJ), where 
we demonstrated the viability of dynamically assembling 
virtual clusters exclusively composed of spot instances to run 
compute-intensive applications. 

Specifically, the contributions of this work are: 

• A multifaceted resource provisioning approach, that in- 
cludes novel mechanisms for maximizing reliability, 
while minimizing costs in a spot instances-based com- 
putational platform; 

• An bidding mechanism that aids the decision-making 
process by estimating future spot prices and making 
informed bidding decisions; 

• An evaluation of two novel fault tolerance techniques, 
namely migration and job duplication, and their compar- 
ison to an existing checkpointing-based approach. 
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Fig. 1. Modeled architecture: Client (broker) and server (cloud) side. The "Runtime estimation" component was the focus of our previous work |2j. Here, 
we focus primarily on the "fault tolerance" component 



The rest of this paper is organized as follows: Section |ll] 
describes related literature on existing approaches that use 



spot instances; Section III describes our existing resource 
provisioning policy and discusses the modifications necessary 



to add a reliability component to it; Section IV details our 
multifaceted approach and discusses each mechanism and 
the interaction between them; Section |V] presents extensive 
simulation-based experimental results and their discussion; 



finally Section VI concludes the paper. 



II. Related Work 

A few recently published works have touched the sub- 
ject of leveraging variable pricing cloud resources in high- 
performance computing. Andrzejak et al. jS) have proposed a 
probabilistic decision model to help users decide how much to 
bid for a certain spot instance type in order to meet a certain 
monetary budget or a deadline. The model suggests bid values 
based on the probability of failures calculated using a mean 
of past prices from Amazon EC2. It can then estimate, with a 
given confidence, values for a budget and a deadline that can 
be achieved if the given bid is used. 

Yi et al. fTOl proposed a method to reduce costs of 
computations and providing fault-tolerance when using EC2 
spot instances. Based on the price history, they simulated 
how several checkpointing policies would perform when faced 
with out-of-bid situations. The proposed policies used two 
distinct techniques for deciding when to checkpoint a running 
program: at hour boundaries and at price rising edges. In 
the hour boundary scheme, checkpoints are taken periodically 
every hour, while in the rising edge scheme, checkpoints 
are taken when the spot price for a given instance type 
is increasing. The authors proposed combinations of the 
above mentioned schemes, including adaptive decisions, such 
as taking or skipping checkpointing at certain times. Their 
evaluation has shown that checkpointing schemes, in spite 
of the inherent overhead, can tolerate instance failures while 
reducing the price paid, as compared to normal on-demand 
instances. Similarly, we evaluate a checkpointing mechanism 
implemented according to this work, with the objective of 
comparing with other fault tolerance approaches. 



III. Resource Provisioning in a Spot 
Instances-based Computational Platform 

In our previous work, we have proposed a resource pro- 
visioning and job allocation architecture and an associated 
policy. Our solution has been tailored for an organization that 
aims at assembling a computational platform solely based 
on spot instances and use it to accomplish a stream of 
deadline-constrained computational jobs. In that work, we also 
evaluated several runtime estimation mechanisms and their 
effect on cost and utilization of the platform, as well as 
deadline violations of jobs. In this section, we summarize how 
our solution works; a detailed description and analysis can be 
found in |2|. 

A Broker component is responsible for receiving computa- 
tional job requests from users, provisioning a suitable VM pool 
by interacting with the provider, and applying a job scheduling 
policy to ensure jobs finish within their deadlines, while 
minimizing the cost. A diagram depicting the components of 
the modeled architecture is shown in Figure [T] 

We have modeled a cloud computing provider according to 
how Amazon EC2 currently works in practice. The provider 
manages a computational cloud, formed by one or more 
datacenters, which offer virtual machines of predefined types 
in a spot market. The provisioning of an instance is subject 
to the following characteristics: clients submit requests for a 
single instance, specifying a type, and up to how much they are 
willing to pay per instance/hour (bid). Optionally, a paiticular 
datacenter can be specified; if left blank, the provider allocates 
the instance to the most economical datacenter choice. The 
system provides instances whenever the bid is greater than 
the cuiTent price; on the other hand, it terminates instances 
without any notice when a client's bid is less than or equal to 
the current price. The system does not charge the last partial 
hour when it stops an instance, but it charges the last partial 
hour when the termination is initiated by the client (the price 
of a partial hour is considered the same as a full hour). The 
price of each instance/hour is the spot price at the beginning 
of the hour. 

Jobs are assumed to be moldable, in the sense that they 
can run on any number of CPU cores, but limited to a single 
virtual machine. To determine the run time of a job in a 



particular number of CPU cores, we use Downey's analytical 
model for job speedup flTl. To generate values for A (average 
parallelism) and a (coefficient of variance of parallelism), we 
have used the model of Cirne & Berman lfT2l . The moldability 
of a job defines it's preferred instance type, i.e. the type on 
which the job will take advantage of the most number of cores 
for a time greater than 1 hour. As a result, longer jobs that 
offer more parallelism will prefer instances with more cores. 

The activities of our proposed algorithm are summarized in 
the steps described below. 

• When any job is submitted, it is inserted into a list of 
unscheduled jobs; 

• At regular intervals (T), the algorithm uses a runtime 
estimation method to predict the approximate runtime of 
the job on each available instance type; 

• The broker then attempts to allocate the job to an idle 
VM with enough time before a whole hour finishes; 

• If unsuccessful, it attempts to allocate the job to a VM 
that is currently running jobs but is expected to become 
idle soon. Runtime estimates of all jobs running on the 
VM, in addition to the incoming job, are required at this 
step; 

• If the job still cannot be allocated, the algorithm will 
decide whether it is advantageous to extend a current 
lease, to start a new VM lease, or to postpone the 
allocation decision according to the job's urgency factor 
and pricing conditions. 

The urgency factor U of a job j is the maximum estimated 
time the job can wait for a resource to be provisioned so that 
the chance of meeting the deadline is increased. It is computed 
as per Equation [T] where Dj is the job's deadline, T is the 
current time, so that Dj — T corresponds to the time until the 
job's deadline; a is the urgency modifier; ej is the estimated 
runtime of j on it's preferred instance type; and B is the 
expected time the provider takes to provision a new VM (fixed 
at 5 minutes). 

Uj = max{0, Dj-T-{a* + B)) (1) 

The greater the value of the a modifier, the more con- 
servative the algorithm becomes, i.e. with higher values of 
a, U approximates 0. A value equal to indicates that a 
resource must be provisioned immediately to complete the job 
within the deadline. Alternatively, lower values of a cause the 
algorithm to postpone more provisioning actions in order to 
maximise the chances of finding lower prices or reusing other 
jobs' instances. 

IV. Mechanisms to Achieve Fault Tolerance 

In this work, we explore a multifaceted approach, which 
relies on two interrelated modalities that define how reliably 
the policy ensures that computational jobs finish before their 
deadlines. The first mechanism aims at choosing appropriate 
bid values based on estimation of price variations and on the 
job's urgency factor U, which influences the choice of when to 
provision a resource for a given job and how much to bid. The 



TABLE I 
Evaluated bidding strategies 



Bidding strategy 


Bid value definition 


Minimum 
Mean 

On-demand 

High 

Current 


The minimum value observed in the price history + G 
The mean of all values in the price history 
The listed on demand price 

A value much greater than any price observed (defined as 100) 
The cunent spot price + G 



second mechanism adds extra levels of fault tolerance through 
checkpointing and migration of virtual machines, as well as 
job duplication. 

These mechanisms aim at mitigating spot instance unavail- 
ability due to out-of-bid situations only, i.e. failures due to 
price variations. Other types of instance failures, for instance, 
due to hardware faults or network interruptions are not consid- 
ered. In other words, we assume that, if no out-of-bid situation 
takes place during an instance lifetime, its availability is 100%. 

A. Bidding strategies: estimating cost and jobs' urgency 

The first mechanism comprises bidding strategies and the 
calculation of the value of U . These are based on estimated 
price variations and job runtimes. More specifically, this mech- 
anism aims to aid the process in two ways: (1) allow the broker 
to make informed decisions on how much to bid, a choice that 
directly influences the risk of failure and monetary spending; 
and (2) combine price information and a job's urgency factor, 
to decide the best point in time to start a new machine for a job, 
thus seeking to cover the period that will yield the minimum 
cost. The rationale behind combining these two pieces of 
information is to avoid hasty decision that may increase costs, 
i.e. to avoid commissioning new resources too early, at times 
when non urgent jobs can be postponed, or too late, when jobs 
will most likely miss their deadlines. 

In our previous work jS), we have compared several run- 
time estimation policies and their impact on cost, deadline 
violations, and system utilization. A simple mechanism that 
computes the average runtime of two preceding jobs of the 
same user has performed consistently well. Therefore, in this 
work, we exclusively employ that technique. 

We have evaluated 5 bidding strategies, which are listed 
on table |l] Two of the strategies use historical information to 
compute the bid. In all cases, a window of one week worth 
of price history, individual to each instance type/OS/datacenter 
combination, is fed to the bidding strategy. The output of each 
strategy is the maximum price, in US dollars per hour, to be 
paid for one particular instance. The minimum bid granularity 
G is 0.001. 

In all cases that can yield values lower than the current price, 
the broker uses the value of U to override the bid value, if 
necessary. Specifically, it applies the steps of Algorithm [T] 

B. Hourly Checkpointing 

Checkpointing consists of saving the state of a VM, appli- 
cation, or process, during execution and restoring the saved 
state after a failure to reduce the amount of lost work llT3l . In 
the context of virtual machines, the action of encapsulating 



16-;— compute bid; 

2 U ^ compute urgency factor; 

3 P ^ query provider for current price; 

4 if [/ then 

5 if 6 P then 

6 ^b = P + G; 

7 else 

8 1^ schedule a bid check at T + U; 

Algorithm 1: Bid check algorithm, which overrides the 
bid value or schedules a new check in the future 



execution and user customization state is a commonplace 
feature in most virtual machine monitors (VMM) |14|. Saving 
a VM state consists of serializing its entire memory contents to 
a persistent storage, thus including all applications and process 
running |15|. In our work, we assume that checkpointing a 
running application is the same as saving the state of an 
entire VM. The advantage of relying on VMM-supported 
checkpointing is that applications do not need to be modified 
to enable checkpointing-based fault tolerance. However, it is 
necessary that cloud computing providers explicitly support 
such operation. 

The technique considered in this work is a hourly-based VM 
checkpointing, where states are saved at hour-boundaries. This 
technique has been previously identified by Yi et al. lITOl as the 
simplest and most intuitive, yet effective, form of dealing with 
the cost/reliability trade-off when running applications on spot 
instances. More specifically, taking a checkpoint on an hourly 
basis guarantees that only useful computational time is paid, 
given that spot instances are billed at an hour granularity and 
partial hours, in the case of failures, are not charged. 

In this method, it is assumed that a checkpointed VM will 
only resume when the original spot request, which has a fixed 
bid and machine type, is in-bid again. No attempt is made to 
provision a new VM by submitting higher bids for the same 
machine type, or to bid for other types. This contrasts with 
our next solution, which considers relocating the saved state 
to a new space in order to hasten job completion. 

C. Migration of persistent VM state 

We propose a migration-based fault tolerance mechanism on 
which the state of a VM is frequently saved on a global filesys- 
tem and upon an out-of-bid situation the state is relocated. 
The migration technique is very similar to checkpointing, as 
it comprises of taking a snapshot of the VM and using it to 
restore the computation upon a failure. But instead of waiting 
for the original request to be in-bid again, the algorithm aims 
to lease a new instance under new terms, and then restore the 
saved VM state into the new instance. 

The definition of new lease terms is subject to the following 
decision (whichever is estimated to be cheaper to accomplish 
the remaining duration of the job): (1) leasing an instance of 
the same type for a higher price in the same datacenter; (2) 



leasing an instance of a different type on the same datacenter; 
(3) or relocating the workload to another datacenter where a 
suitable VM may be leased for a cheaper price. The overhead 
of restoring a failed VM in a distinct datacenter is assumed 
to be higher than when the same datacenter is chosen. This 
overhead is taken into account by the algorithm when making 
a relocation decision. 

All computation in the VM is paused while the snapshot 
is being taken. The overhead of saving an instance state (the 
same as taking a checkpoint) is defined as the time to serialize 
a VM's memory snapshot into a file in a global filesystem. 
This value is different for each instance type, according to 
their maximum memory size. The exact values are computed 
as in the work of Sotomayor et al. fl6|, which provides 
a comprehensive model to predict the time to suspend and 
resume VMs. The times to suspend (i.e. save the state) and to 
resume (i.e. restore from the latest saved state) a spot instance 
with TO MB of memory, are defined as per equations [2] and [5] 
respectively lT6l . 

tg = m/s (2) 

tr — m/r (3) 

Values for s and r (rates, in MB/s, to write/read to MB of 
memory to/from a global filesystem) are also taken from |16|, 
who obtained them from numerous experiments on a realistic 
testbed. Therefore, s is 63.67 MB/s, and r is 81.27 MB/s (to 
restore a state in the same datacenter). We assume half the rate 
(40.64 MB/s) when moving/restoring a VM state into/from a 
distinct datacenter. 

D. Duplication of long jobs 

We also propose a fault tolerance mechanism that does not 
require any application- or provider-assisted technique, as it is 
the case of VM-based checkpointing and migration. With task 
duplication, we aim to evaluate a simpler method for rapid 
deployment of applications on spot instances using currently 
available cloud computing feature. 

Similar to replication and migration, duplication of work 
aims to increase the chance of success in meeting deadlines 
when running longer jobs (greater than one hour) over a 
period of frequent price changes. Therefore, a duplication- 
based technique was implemented and evaluated. 

This technique also relies on estimates of jobs runtimes. It 
creates one replica of each job that is expected to run for more 
than 1 hour. The replica is submitted to the same scheduling 
policy as the original job. The algorithm applies the same 
rules as it does to a regular job, but avoids choosing the the 
datacenter/type combination where the original job will run. 
Choosing a different combination for a replica is an obvious 
choice, since two jobs running on the same datacenter, using 
the same instance type, will certainly fail at the same time 
when the price increases. 




TABLE II 

Factors and their levels 



Factor 


Possible values 


Bidding Strategy 

OL 

Fault tolerance mechanisms 


Minimum, Mean, On-demand, High, Cunent 
1, 2, 4, 8, 10, 20 

None, Migration, Checkpointing, Job duplication 



V. Performance Evaluation 

In this section, we evaluate the proposed fault-aware re- 
source allocation policy and the effect of its mechanisms, 
using trace-driven discrete event simulations. We quantify the 
performance of our policy based on three metrics, two absolute 
(monetary cost and deadline violations) and one relative (dollar 
per useful computation). We especially observe the interaction 
between these metrics, given that there is a known trade- 
off between them, i.e. assuring less violations usually means 
provisioning more resources, hence higher costs. 

A. Experimental design 

We have designed our experiments to study the influence 
of the following factors and their levels: (1) bidding strategy; 
(2) the value of the urgency factor modifier a; and (3) choice 
of fault tolerance mechanism. The factors and their levels are 
listed on Table |ll] 

Not all combinations of factors have been simulated; for 
example, these was little sense in combining the High bidding 
strategy with a fault tolerance mechanism, given that the 
bidding fashion itself completely avoid failures. In total, 5952 
experiments were executed. All values presented correspond 
to an average of 31 simulation runs. When available, error bars 
correspond to a 95% confidence interval. The simulator was 
implemented using the CloudSim framework tl7j . 

Cloud characteristics: We modeled the cloud provider after 
the features of Amazon EC2's US-EAST geographic region, 
which contains 4 datacenters. Instance types were modeled 



directly after the characteristics of available standard and high- 
CPU types The types available to be used are Ml. SMALL 
(1 ECU), Ml.LARGE (5 ECUs), Ml.XLARGE (8 ECUs), 
CLMEDIUM (5 ECUs), CLXLARGE (20 ECUs). One ECU 
(EC2 Compute Unit) is defined as equivalent to the power 
of a LO-1.2 GHz 2007 AMD Opteron or 2007 Intel Xeon 
processor A period of 100 days worth of pricing history 
traces has been collected comprising dates between July 5th, 
2011 and October, 15th, 2011. These dates correspond to the 
available traces since Amazon EC2 has started offering distinct 
prices per individual datacenter, rather than per geographic 
region. 

Workload: The chosen job stream was obtained from the 
LHC Grid at CERN |[l8l, and is composed of grid-like 
embarrassingly parallel tasks. A total of 100,000 jobs are 
submitted over a period of seven days of simulation time, 
starting from a randomly generated time within the available 
price history. This workload is suitable to our experiments to 
due to its bursty nature and for being composed of highly 
variable job lengths. These features require a highly dynamic 
computation platform that must serve variable loads while 
maintaining cost efficiency. The moldability parameters A and 
a of each job are assumed to be known by the broker. 

Originally, this workload trace did not contain information 
about user-supplied job runtime estimates and deadlines. User 
runtime estimates were generated according to the model 
of Tzafrir et al. [19|. A job's maximum allowed runtime 
corresponds to the runtime estimate multiplied by a random 
multiplier, uniformly generated between 1.5 and 4. Conse- 
quently, the job's deadline corresponds to its submission time 
plus its maximum allowed runtime. 

B. Effects of bidding strategies and urgency factor 

In order to understand how our bidding strategies work, 
independently of fault tolerance mechanisms, we have evalu- 
ated their effectiveness in a scenario where a failed job must be 




(a) Migration (b) Clieckpointing (c) Job duplication 

Fig. 3. Performance of migration, checkpointing and job duplication on monetary cost 



restarted from the beginning after a failure. In this experiment, 
we aimed at quantifying each strategy's performance when 
paired with different values of a. 

Figure [2] shows the effect of most aggressive (a — 1) to 
most conservative (a = 20) urgency estimations under various 
bidding strategies. In these circumstances, bidding strategies 
that produce higher bids tend to perform better, both in terms 
of cost and deadline violations. In particular, we have observed 
that the On-demand strategy avoids failures due to minor 
price increases, as well as avoids incurring the cost of high 
prices above the on-demand price. This fact can be noticed in 
the performance comparison between On-demand and High, 
which incurs extra cost due its very high bid. As expected, 
strategies that aim at bidding low values experience the most 
failures, hence more loss of work, and consequently higher 
costs. 

The value of a significantly influences both cost and dead- 
line violations, consistently over all bidding strategies. Figure 



2(a) indicates an optimal value of 2, which yields the lower 
costs, although for most bidding strategies, the difference 
between 1 and 2 is not statistically significant. Regarding 
the deadline metric, 1 and 2 lead to many more deadline 
violations. This is due to the fact that lower values of a 
cause the algorithm to postpone more decisions, which in turn 
often leads to the inability of provisioning resources "at the 
last minute". Conservative values, on the other hand, lead to 
virtually no violations, but significantly higher costs. 

C. Migration, checkpointing, and job duplication 

Our results also demonstrate the positive effects of the 
studied fault tolerance mechanisms when paired with bidding 
strategies and urgency factor estimation. Figure [5] shows a 
comparison of migration, checkpoint and job duplication on 
the cost metric. We only show values of a of 2, 4 and 8, 
which yield the best costs in all cases. An interesting fact 
is that migration performs better when paired with bidding 
strategies that choose lower bid values, such as Minimum and 
Current, while checkpointing benefits from higher bid values, 
such as Mean and On-demand. This behaviour is coherent 
with the features of each mechanism. Migration tends to have 



more choices after an out-of-bid situation given its ability to 
choose other types of instances from multiple datacenters. 
Checkpointing, on the other hand, is bound to a persistent 
request, and will benefit from a higher chance of being in-bid 
most of the time. 

Job duplication performs poorly in all cases, yielding much 
higher costs when compared to the case when no fault tol- 
erance exists. It's merit however, lies on its simplicity and 
the capability of replicating jobs across multiple datacenter 
Therefore, it can be useful in cases where an extra level of 
redundancy is required. 

Figure |4] presents a summary of best combinations of 
strategies discovered in our simulations. Overall, the migration 
technique, along with the Minimum bidding strategy and 
a — 2 produced the lower cost. However, a ~ 8 produced 
the least number of deadline violations (30 out of 100,000 
jobs). These results confirm that the trade-off between cost 
and deadline violations applies in this case. 

In summary, our results demonstrate that the interaction of 
factors can influence the exact choice of bidding strategy, a, 
and fault tolerance mechanism. It is expected that, in absolute 
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Fig. 4. Most economical combinations of bidding strategy and fault tolerance 
mechanism 
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Mean 
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terms, more conservative urgency factors will lead to less 
deadline violations and a greater cost. To help gauge a more 
precise metric, we define dollars per useful computation as 
the ratio between the total cost and the number of jobs that 
finished within their deadlines. Table ranks the 10 best 
factor combinations according this metric. The combinations 
that employ migration rank consistently superior, which makes 
these combinations good candidates for environments where a 
strict meeting of deadlines is expected. 

VI. Conclusions and Future Work 

In this work, he have proposed a multifaceted resource 
provisioning policy that reliably manages a pool of intermittent 
spot instances. Our policy contains multiple mechanisms, 
including 5 bidding strategies, an adjustable urgency factor 
estimator, and 3 fault-tolerance approaches. 

We have performed extensive simulations under realistic 
conditions that reflect the behaviour of Amazon EC2, via a 
history of its prices. Our results demonstrate that both costs 
savings and stricter adherence to deadlines can be achieved 
when properly combining and tuning the policy mechanisms. 
Especially, the fault tolerance mechanism that employs mi- 
gration of VM state providers superior results in virtually all 
metrics. 

Currently, the cloud computing spot market is still in its 
infancy. Therefore, many challenges have not been encoun- 
tered, given the short history and relatively low variability of 
Amazon EC2 prices. In this sense, we plan to further improve 
our policy by devising bidding strategies that will perform 
well in environments with highly variable price levels and 
more frequent changes. We expect fault tolerance to be even 
more crucial in such scenarios. We also plan to lean towards 
provider-centric research, by studying the challenges involved 
in setting spot prices under various demand patterns. 
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